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perf lets you oa 


Profi le your 
Programs! 


with low OVEM O 


And more! KY = 





What's this? 


(only Linux L) 

a 

pect on Linux is one of my Favourite 
debugging tools, It lets you: 


æ trace system calls faster than strace 

K profile your C, Go, Ctt: node.js, Rust, and Sava / SVM 
programs really easily 

Kx = trace or count almost * any* kernel event 
("perf, count how many packets every ae ees 


Tve even used it more than once to profile Ruby programs, 
so it's not just for systems wizards. 


This zine explains both how touse- tha most 
impor tant perf subcommands, and a little bit 


about how perf works under the hood . 


let me show you my favourite 
perf features + how I use itf 


Suria Eyan S 


@ bork 
https: // wns.ca 
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pert top 


My Favourite place Tto stact with pect is 
‘perf top’. 


WU 


top 















I Know how much Well ZI know how 
CPU every program) much CPU every 


Ui 
O 


pert top 


IS using Function is using | 





T like to con ` perf top" 
on machines When a program 
iS Using lOO% of the CPL 
and I don't Know why . 





As an example, let's profile a really simple 
program T wrote. T+ has On Single Pado 
Mron- hwenn uaan] which iS an infinite loop. 


Here's +he code VOid run_awesome_function () { 
int x = 0; 


LT ran. T caled while (1) { 
x=x+ 1; 


+he binary } 


use- cpu". int main() { run_awesome_function(); } 


While that s running, stact perf top. It needs 
+o rUn AS root, like every perf Subcommand. 


$ sudo perf top 


pert top out put 


Heres what it looks like when I run perf top 
when use- cpu’ iS ruming on my laptop : 


O 

100,00% use_cpu [.] run_awesome_function 
0,00% [kernel] [k] smp_call_function_single 
0,00% [kernel] [k] load_balance 


© % of the CPU the function is using 
© Name of program GC library 


© function name / symbol 


This is telling us that l00% of the CPU 
time is being spent in `run- awesome- function. 


perf top can tell you both about 
* functions in userspace programs 


x functions in the Kernel 





I 
Heres what it looks like when the kernel is using a lot of CPU: 
Kerne\ Fonctions 
27.70% [kernel] Lk] cpuidle_reflect 
11.87% libxul.so A Gel init 
10.24% [kernel] qk] J aesni_enci 


6.75% [kernel] Sef Boop Sat 


3.94% [kernel] [k] find_get_pages_contig 


this Function is doing ence ption ("aes") because 
Im writing to an enecupte Filesystem 


“@ perf mixtape w g 






pert recocd 


pert top is great for getting a quick idea of what's 
happening, bot I often want to investigate more in depth. 


collects the same information as perf top 


but it lets you save the data to analyse later. 


It saves it in a file called “perf. data" 
iN your current directory : 


Chee ER 






hey, here's some 


profiling data l 
. bea 
=e T'll save it in | pect recard | 


a File called 
perf. data 
There are 3 main Ways to choose what processtes) to 
profile with perf record : 


hale) Gee perl ee alae ata, > starł COMMAND 
ee ee l and profile it until it 


exits 
oe P| <«— profile PID until you 
@ perf pecore- PID] press ctrl +c 
E OTE Ba. 1 profile ever process 
© (perf record -ai Unio eee eee 


There's a 4th hybrid thing you can da: if you specify 
both a PID lor -a) and à Command , it'll profile the PID 
until the command exits. Like this: 


COMM 
ne a 
‘perf record -p 8325 sleep 5; 


é This useful trick lets you profile PIO 8325 for 5 seconds! 


Collect Tracing dota with perf record 


Go fac we've collected profiling data with perf: 

(“what fonction is running? ). When perf collects profiling 
data, it samples = it'll check what funclion is running say 
[0O + imes / second. 


Bot pect can also record lots of different kinds oF 
events . And when it records events, it doesn't sample -- if 


you ask it to record system calls, it'll attempt to cecard : 
every single system call. 7% 









2 list evec 
Here are a few of those events: event wi 
= system calls i perf list: 
| 


- sending network packets 
- feading from a block device (disk) i X 
= pene) switches / page Faults 
- and you can make anu kernel 
function into an event! (that's called “kprobes") 


For example, let's Say you hove a program making out bound 


netwock connectrons, bot you don't know which Program 
Or why. perf can help! 


This perf incantation records every time a program connects 
to a web server (the ‘connect! ) System call, and it also 


records the stack trace that led upto thot syscall. 


n a eee eee are a —-----------------f Stack 
l i ` trace 
‘perf record -e syscalls:sys_enter_connect -ag ; 


Being able to take a syscall / page fault / disk write and trace it 
back’ +o the exact code thal caused it is pretty magical. 


analyzing perf cecord data 


3 ways to ee a perf.data’ file generated by 
perf record: 


perf sorl quick interactive report showing 
Y po 


you which functions are used the most 


100, 00% 0,00% use_cpu use_cpu [.] main 
100, 00% 0,00% use_cpu libc-2.23.soọ [.] __libc_start_main 
100,00% 100,00% use_cpu use_cpu [.] run_awesome_function 


100% of the Ka is spent in this functi on! 


perf annotate will tell you which 


Iraia’ 
ee Ea 


U 
= 1 
N - 1 bl . ; 
; pert annotate za assem y instructions your Program 
T E Cais spending most of its time 
StRk lee ce ! 
assembly instructions! | executin (be careful, can be 
oft by one instruction) 
Disassembly of section Ytext: 
00000000004004d6 <run_awesome_function>: ae 
run_awesome_function(): X in xpyctio 
0.00 : 4004d6: push %rbp Ais 0.0 e all the t 
0.00 : 4004d7: mov %rsp,%rbp < wnet spe" 
0.00 : 4004da: movl $0x0, -0x4(%rbp) e tines bein 
100.00 : 4004e1: addl $0x1, -0x4(%rbp) i 
0.00 : 400465: jmp 4004e1 <run_awesome_function+0xb> 
Percent | Source code & Disassembly of kcore for cycles:pp 


‘perf script prints out all the 
Samples pecf collected as text so 


| 
I 
| 
I 
í you can run Scripts on the out pet 
I 

| 

I 


to do analysis. Like the f lamegraph 
script th l 
a a pagan E 


use_cpu 23001 (19774. 727477) 349732 cycles:pp: 
stack y 4e1 run_awesome_function (/home/bork/work/perf-zine/use_cpu) 
hace 4f5 main (/home/bork/work/perf-zine/use_cpu) 
20830 _ libc_start_main (/1lib/x86_64-linux-gnu/libc-2.23.s0) 
8fe258d4c544155 [unknown] ([unknown] ) 
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à n ckeock ion 


f| ameqraphs 


Flame graphs are an awesome way to visualize 
pro filing data, invented & populaciz ed by 
Brendan Gregg. 


Here's what they look like: 
l 











bamboo 23 % | bite 20% | | teeth 32% 
panda 40% alligator 60% 





main 100 % 





They're constructed from lots ( usually thousands) of stacktraces 
Sam pled from a program. This one above means that H0% of 
the stacktraces started with [anda and 32% with alligator 
teeth 
To generate Flamegraphs, get 
z github. com/ brendangregg / F lamegraph : 


and put it in your PATH. Once you have that, here's 
how to generate a Flamegraph. 


- = = m e — eel — oe 


— a = m — a — —m =e = ~ 
— — 
—_—-_— m — 


I ! 
$ sudo perf script | stackcollapse -per f.pl | 


2. _ _ _| Flamegraph.pl > graph.sva i 
gers 


1 

| 

I 
' 


Open this in your browser | 


(this is the same ‘perf script From the previous page!) 
q 


node.js _ nye 
perf T or J =e Q: 
Sava a. 
Normally with inter preted languages like node. js, perf 
will tell you which interpreter function is cunning but not which 


Sava script function is running. Bot: 


© & We can help tell perf 
sre what's going on T s ast in time 


This works because both node and Sava have a SIT 










ompiler. 
’ Ooo 
function my_cool_fun { ou know, Im © 
// do a thing actually going to 
} just -in- time compile node. 4S 


that to machine code 


SIT compiled instructions 
hey, those instructions 
correspond to the 

my- cool- foun function 





Ox affe affe 





"= Q 
Ox affe bafe node. 4S 


node communicates with perf by writing a File called 
/tmp/ perf- $PID. map 





How to set this up: 


© 


node.j S 


u 
Java 


O get perf -map- agent from qithub 
@ Find PID of process 
@) create -jova.-perf-map- sh $PID 


Node nes pect -basic -prof 
Program.4s 


Why are there kernel Functions ©? 
in my stack trace ? 


Sometimes you'll get a stack trace From perf, 
and itl mix Functions From your program 

(like _- geTdents 6U) and Functions from the 
kerene) (like btrfs_real_readdir). This is normal! 


Example: 


find 27968 97997.204322: 707897 cycles:pp: 
7fffc034eac7 read_extent_buffer ([kernel.kallsyms] ) 
7fffc032e4f7 btrfs_real_readdir ([kernel.kallsyms ] ) 
7fff81229eb8 iterate_dir ([kernel.kallsyms] ) 
7fff8122a359 sys _getdents ([kernel.kallsyms] ) 
7fff81850fc8 entry_SYSCALL_64_fastpath ([kernel.kallsyms] ) 
c88eb _ getdents64 (/1ib/x86_64-linux-gnu/libc-2.23.s¢ 


T+ usually Means either Your program did a system 
call oc there was a page Fault, and it's telling you 
exactly which kernel functions were called as a 
result of that syscall . 


For example (because I'm using the btrfs file system) 
in this case the “getdents ‘ syscall calls the 
btrfs_real_ceaddir function. Neat! 





Oh, the kerne\ isa't Magic, 
it kinda makes sense ! 


* Der F cheat sheet x 


important Command line arguments; 


Y what data to geta @ what program(s) to look atw 


-F: pick = le Frequency, -a: entire system 
97 record s$ ack traces -p: specify a PID 


-@: choose events to recor d COMMAND : run this cmd 





x perf top: get updates live Tx 


# Sample CPUs at 49 Hertz, show top symbols: 
perf top -F 49 


Cunduros 


# Sample CPUs, show top process names and segments: 
perf top -ns comm, dso 


# Count system calls by process, refreshing every 1 second: 
perf top -e raw_syscalls:sys_enter -ns comm -d 1 


# Count sent network packets by process, rolling output: 
stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings 


bur 
Vane 


* pect stot : count events T CPU counters T * 


# CPU counter statistics for COMMAND: 
perf stat COMMAND 


# *Detailed* CPU counter statistics for COMMAND: 
perf stat -ddd command 





# Various basic CPU statistics, system wide: 
perf stat -e cycles, instructions, cache-misses -a 


# Count system calls for PID, until Ctrl-C: 
perf stat -e 'syscalls:sys_enter_*' -p PID 


# Count block device I/O events for the entire system, for 10 seconds: 
perf stat -e 'block:*' -a sleep 10 


k Reporting x 
# Show perf.data in an ncurses browser: 


perf report 


# Show perf.data as a text report: 
perf report --stdio 


# List all events from perf.data: 
perf script 


# Annotate assembly instructions from perf.data 
# with percentages 
perf annotate [--stdio] 
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Need kernel debuginfo 


sourced From brendangregg.com / perf.htm\ , 
which has many more great examples 


* pect trace . trace system calls & other events x 


# Trace syscalls system-wide # Trace syscalls for PID 
perf trace perf trace -p PID 


k perf record: record profiling data x ees E 
eco into 


# Sample CPU functions for COMMAND, at 99 Hertz: £ data File 
perf record -F 99 COMMAND perr.dato Tı 


# Sample CPU functions for PID, until Ctrl-C: 
perf record -p PID 


# Sample CPU functions for PID, for 10 seconds: 
perf record -p PID sleep 10 


# Sample CPU stack traces for PID, for 10 seconds: 
perf record -p PID -g -- sleep 10 


# Sample CPU stack traces for PID, using DWARF to unwind stack: 
perf record -p PID --call-graph dwarf 


x perf record : tecocd Tracing dorark 


# Trace new processes, until Ctrl-C: Sess aot 
perf record -e sched:sched_process_exec -a ; 
perf.data file 


# Trace all context-switches, until Ctrl-C: 
perf record -e context-switches -a 


# Trace all context-switches with stack traces, for 10 seconds: 
perf record -e context-switches -ag -- sleep 10 


# Trace all page faults with stack traces, until Ctrl-C: 
perf record -e page-faults -ag 


k adding new trace events * 


# Add a tracepoint for kernel function tcp_sendmsg(): 
perf probe 'tcp_sendmsg' 


# Trace previously created probe: 
perf record -e -a probe:tcp_sendmsg 


# Add a tracepoint for myfunc() return, and include the retval as a string: 
erf probe 'myfunc%return +0($retval):string' 


p 
# Trace previous probe when size > ©, and state is not TCP_ESTABLISHED(1): 
perf record -e -a probe:tcp_sendmsg --filter 'size > © && skc_state != 1' -a 


# Add a tracepoint for do_sys_open() with the filename as a string: 
perf probe 'do_sys_open filename:string' 12, 


pert stat: CPU counters LDRRAE 


If you're Writing high- performance programs, 
there ace a lot of CPU/hardware -level events 
You might be interested in counting : 


page 
Faults 


TLB 
misses 


hardware % | 
counters¥ 


Basically Linux can ask your CPU to start recording 
various statistics: 





You might wonder: 










how can T tell what 
the LI cache hit rate 

is though?? T’d need to 
look INSIDE THE CPU?E 









hey can you count] (on itY 


he hit 
L1 cache St 


misses 
As on example : here's pact of the output of ‘perf stat -ddd ls“ 
C3 is for detailed 





hey can you count 


L1 cache hits + 
misses? 






$ sudo perf stat -ddd 1s -R / 
Performance counter stats for 'ls -R /': 


3849.615096 task-clock (msec) # 0.535 CPUs utilized 
26,120 context-switches # 0.007 M/sec 
342 page-faults # 0.089 K/sec 
8,583, 744,395 cycles # 2.230 GHz 
10 10,337,612, 795 instructions # 1.20 insns per cycle 
billion 1,987,339, 660 branches # 516.244 M/sec 
instructions 20, 738,878 branch-misses 1.04% of all branches 
happen 2,883,947, 626 dTLB-loads branch # 749.152 M/sec 


Fast. 7.192555725 seconds time elapsed Prediction 
i4 stats 


perf stat : count any event 


You can actually count lots of different events with 
pect stot. The same events you can record with perf record! 


Here are a couple examples of Using ‘perf stot’ on 
Is -R (which lists Files recursively , so makes lots of syscalls) 


© count context switches between the 


kernel and userspace! 
$ sudo perf stat -e context-switches ls -R / 
Performance counter stats for 'ls -R /': 
20,821 context-switches 


@® count system calls! 


wildcard 
$ sudo perf stat -e 'syscalls:sys_enter_*' ls -R / > /dev/null 
-? 8,028 syscalls:sys_enter_newlstat 
T can these 15,167 syscalls:sys_enter_write 
YhcovaNn 254,755 syscalls:sys_enter_close 
bon 254,777 syscalls:sys_enter_open 
Sor X 509, 496 syscalls:sys_enter_newfstat 
to oe i 509,598 syscalls:sys_enter_getdents directory 
kop list entries 


perf stat does introduce some overhead. Counting “every, * 
System call for “find” made the program run up to 


6 times slower in mu brief experiments. 


L think QS long AS yov only count a Few different 


events (like just the'syscalls:sys_enter_open event) 
it should be Fine. T don't 100% understand 


why there's 
so much overhead here. though. 


pert trace 


Strace iS an awesome Linux debugging tool that- 
traces system calls. T+ has one problem though « 


Program 


pert trace traces system calls, but with way less 
overhead. T+'s safe to ron in production, unlike strace. 









oh no now T 
am running 
lOx slowec 


Lam going +o 
trace you I 


Thee are 2 disadvantages though las of Linux 4.4) 


O sometimes it drops System calls 


[this is sort of an advantage because it line 


@ it won't show you the strings thaT ace 
being read (weitien. 


Here's a comparison of both strace and perf trace 
Output, on the same program. 


u Jeani u T 
brk(brk: 0x2397000) Y - brk(0x2397000) «£ = 0x23. 
write(fd: 2</dev/pts/18>, buf: 0x23: - write(2, "bork@kiwi:~$", 13) = 13 


read(buf: Ox7ffd77b0a8d7, count: 1 - read(0, "\4", 1) 1 
ioctl(cmd: TCGETS, arg: Ox7ffd77b0ai - ioctl(@, TCGETS, {B38400 opost isig.. 
ioctl(cmd: TCSETSW, arg: Ox7ffd77b0i - ioct1(90, SNDCTL_TMR_STOP or TCSETSW, 
` I 

T hese have the same write System call bot only strace actually 
Shows you what String was written .- 

200,000? Files which was a VERY USEFUL CLUE to help Figure out 

4 
that Docker gets container sizes by looking at every file. T used perf 


trace because I didn't want to deal with strace’s overhead ¥ 
lo 


how perf works: overview 


Now that we know how to use perf, let's see 


how it works F 

The perf System is sp lit into 2 parts: 
© a program in userspace called “perf” 
© a System in the Linux Kernel 


When you ron ‘perf record’, ‘perf stat', or' perf top’ 
To get information about a program, here's what happens: 


= perf asks the kerne\ to collect information 
[E] A porfile this program) A profile this program) le this Pesan] Cont US Sey oe 
pect Veollect system calls T) \LolecF system calls T) calls I 


program 
Linux 
count network packets I kerne ì 





—> the kernel gets samples/ traces / CPU counters 
From the programs perf asks about. 


— pert displays the data back to YOU ina 
Chopefully) useful way. 


So here's the big picture: 


0—®-A— 2s 
e > aA 


programs Im 


p perf userspace r i 
natyzing 


me program kerne | 7 


on kecnel versions 


perf works really closely with the Linux kernel. This 
means a couple of things: 


— You need to install a version of perf that 
exacHy matches your kernel version. 
On Ubuntu, you can do that with: 


Sudo apt-get install linux-tools - $ uname -) 


perfs features land sometimes command ine. 
options) change between Kernel versions. 


The First version of perf was in Linux 2.6 


This also means thot there's a perf documentation folder 
in the Linux qit repository | You can see it on github: 


github.com/torvalds/linux/tree/master/tools/perf/Documentation 


Some of the cool things in there: 
- perf. data file format s} <c 


- how to use perf 's built in Python interpreter (3) 
to write scripts 


- all the man pages for each perf subcommand _ 
bench) fevlish CDir) 
sonia Waray Cts) Calk TD Gp) 


how profiling with perf works 


The Linux Kernel has a built in sampling profiler: 


A 


Linux 









Í checked what fonction the program 


WOS running 50,000 times and here 
are the results | 


How does Linux know which functions Your program iS 


running though Well -- the Linux Kernel is in charge of 
Scheduling. 


That means that atall times it has a list of every process 
and the address of the CPU instruction that Process is 
currently cunning. That address is called the instruction pointer. 


Here's what the information the Linux kernel has looks like: 


Command PIO thread 10 instruction pointer 
python 2379 2379 Ox 007354 dad 
bash 1229 1224 Ox OO 123456 
Use- cpu 494! 499 | Ox abababab 
Use- cpu 4ga 494] Ox ababbebob 


Sometimes perf can't figure out how to turn an instruction 
pointer address into a function name. Here's an exam ple of 
what that looks like: 


2% mystecious address f1 


0.00% nodejs nodejs [.] 0x0000000000759d20 
0.00% V8 WorkerThread [kernel.kallsyms] [k] hrtimer_active 


which programming languages 
Can pert profile? 
The way per f usually figures out what Function your 
programs are running cS* 
© get the program's instruction pointer address 
get a copy of the program's stack 


@ unwind the stack to find the address of 
the current fonction call 


© use the program's symbol table to figure 


out the name of the symbol that address 
Corresponds to T 


The important thing to understand is that perf will by default 
give you a symbol from the program's Symbol table 


That means perf won't Qive You function names foc binaries 
where the symbols are stripped- 


Here's how perf can help you, broken down by programming language: 


perf will tell you what 


function iS run neng 


Python, Ruby, PHP, 
o ther inter preted 
languages. 


node. 4s TVM perf will tell 
Tava /Scala / Clojure languages 


you about the 
inter preter 
perf can use an altecnate | (I 
melho Fa find the al fonction (can still be usefu 1) 
(like we explained on page 10) 






Pert : under the hood 


Tt's often useful to have a basic understanding oF 
how our tools are implemented. So let's look at the 
interface the userspace tool C'perf') uses to talk to 
the Linux kernel. Here's what happens, basically « 


© perf calls the perf_event_ open system call 
@ the kernel writes “events” to a ring buffer 
IN user space 
© perf reads events off that ring buffer and 
dis plays them to you Some how 


What's a ring buffer? 


Basically , it's important To use a limited amount 
of memory for profiling events. So the kernel allocates 
a Fixed Amount of memory : 


APA PNO O O B 


each of these is space for 1 record 
Gnd whea that memory gets full because 
New records are being written faster than perf can 


read them)... BRIAR 


a 


Linux 





whoops! were out of space , guess I 
can't write more events I 


So if you See warnings From perf about events 
being dropped, thats what's happening. 
2\ 


the perf- event_ open system call 


This system call is how perf asks the Linux kemel 
to stact sampling oc tracing. 


Here's the system cal\'s Siqnature * 


int perf_event_open(struct perf_event_attr “attr, 


pid_t pid, int,cpu, int\group_fd, 
unsigned/long{flags), 
PIOS CPU to look at. 


: g this is where most of 
Re ee enem the arguments are 
L don't Find this man page all that useful for day-to-day 


pect vsage. Butt Did you know that the ‘perf CLI tool isn't 
the only program that uses the perf_event_ open syscall? 


The ‘bec’ project i$. a toolkit For writing advanced profiling 
tools using e BPF. C) Github. com /iovisor /bcc 


With bec, you can relatively easily use perf —event_open 
to create your own custom protiling /tracing events T A 


then you can write code to aggregate / display them ang 
way you wont. 


Search BCC- PERF. OUTPUT in the becc docs to learn 
more. 


LL 


W more perf resources W 


Thanks for reading T A few more useful resources: 


7 Brendan * brendan gregg.com/perF. html e- 


Gregg's iS my favourite perf resource. His blog & 


blo 
3 talks are also useful ! 






LWN isa great Linux publication, and 
they Sometimes publish articles about perf | 






cut 


Linux Weekly News 
LWN.net 


pect has man pages GS you'd expect. 
“man perf top", For example. 


+ Pick a program and try to profile it! 
= See what your kernel is doing under diffecen} workloads! 


> Try recording ( counting a few kinds of perf events and 


see what happens I 
Qood luckf 
Ş have fun tu 
\ 


SutiA 
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like this ? 
there are more 
Zines at: 
http://jvns.ca/zines 
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